The Analysis
Using the data set provided and outlined below, we’d like you to build a forecast for customer service email volume at the daily level until December 31, 2021. The purpose of the forecast will be used to determine our daily customer service staffing needs based on the email volume of the day.
Load Libraries
The Data
Data Overview
Based on information, here is what we know about our data: The data set contains simulated data for Spotify customer service email volume, subscriptions, and MAUs, but the patterns within are similar to real data. Email, subs, and MAU data are available from January 1, 2019 to May 31, 2021. Forecast subs and MAU were estimated in March 2021. You do not have to use every variable listed for your final model.
Data Import
First we’ll import the data as an excel file
CS_Forecast_Exercise_Data <- read_excel("data/CS Forecast Exercise Data.xlsx",
col_types = c("date", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric"))We’ll double check our date formats and change as needed.
# check date format
# str(CS_Forecast_Exercise_Data)
# change to Date
CS_Forecast_Exercise_Data <- CS_Forecast_Exercise_Data %>% mutate(date = as.Date(date))Next we’ll briefly summarize our data below. We can see that we have 1,096 total records with 12 variables; one date variable and the rest numeric.
| Name | CS_Forecast_Exercise_Data |
| Number of rows | 1096 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2019-01-01 | 2021-12-31 | 2020-07-01 | 1096 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| 214 | 0.80 | 9079 | 2299.08 | 4691 | 7661 | 8814.5 | 9940.5 | 29920 | ▇▃▁▁▁ | |
| subs | 214 | 0.80 | 50948359 | 7276628.25 | 39516647 | 44673674 | 50993731.0 | 56826297.0 | 63146937 | ▇▇▇▇▇ |
| subs_standard | 214 | 0.80 | 22501460 | 2311921.40 | 17823463 | 20564774 | 22252926.5 | 24491684.5 | 26965404 | ▂▅▇▇▂ |
| subs_student | 214 | 0.80 | 3669271 | 169318.01 | 3367301 | 3527783 | 3663944.5 | 3840557.5 | 3915350 | ▃▆▃▃▇ |
| subs_family | 214 | 0.80 | 20738998 | 4015456.70 | 14237991 | 16827842 | 21159042.5 | 24282939.2 | 26770934 | ▇▅▅▆▇ |
| f_subs | 821 | 0.25 | 65675808 | 2453856.74 | 61771410 | 63530832 | 65669332.0 | 67783773.0 | 70300852 | ▇▆▆▇▃ |
| f_subs_standard | 821 | 0.25 | 27037476 | 596568.18 | 26311010 | 26566314 | 26675257.0 | 27502558.5 | 28722821 | ▇▁▅▁▁ |
| f_subs_student | 821 | 0.25 | 3418401 | 23278.42 | 3384047 | 3394838 | 3418636.0 | 3443110.0 | 3453084 | ▇▅▃▅▇ |
| f_subs_family | 821 | 0.25 | 27713511 | 874456.22 | 26288684 | 26917562 | 27674145.0 | 28473306.5 | 29306612 | ▇▆▆▆▆ |
| mau | 214 | 0.80 | 70419337 | 7816536.85 | 58437292 | 62546144 | 70435576.5 | 76536461.8 | 83248627 | ▇▃▅▆▆ |
| f_mau | 821 | 0.25 | 87005037 | 2433158.16 | 82956358 | 85119070 | 86601582.0 | 88799819.5 | 92609723 | ▅▇▆▅▂ |
Time Series Visuals
We’ll use this information to visually inspect date, email, subs, subs_standard, subs_student, subs_family as a collection of time series. What we can see immediately is that subs and sub_family are both fairly linear trends without visually strong seasonal patterns. On the other hand subs_standard exhibits both a linear trend and what looks like monthly seasonality. Meanwhile subs_student shows a steadily increasing volume peaking in early 2020 and steadily declining thereafter.
CS_Forecast_Exercise_Data %>%
select(date, email, subs, subs_standard, subs_student, subs_family) %>%
filter(date <= '2021-05-31') %>%
pivot_longer(!date, names_to="type", values_to="count") %>%
ggplot(aes(x = date, y = count)) +
geom_line() +
facet_wrap(~ type, scales = "free_y") +
scale_y_continuous(label=comma) +
theme_minimal()Next we we’ll want to focus on our outcome variable email to more closely inspect the time series for visual cues that may be helpful in forecasting such as seasonality.
We can see from the initial plot there are various spikes in volume that are over about 20k. Inspecting the dates there doesn’t seem to be a common date occurrence for these spikes. There doesn’t seem to be a strong overall trend either.
CS_Forecast_Exercise_Data %>%
select(date, email) %>%
filter(date <= '2021-05-31') %>%
plot_time_series(date, email)Lets plot out our time series but this time annotate apparent anomalies. Inspecting the dates of apparent anomalies, we can confirm no recurring date pattern, letting us know we may benefit from cleaning up these anomalies.
CS_Forecast_Exercise_Data %>%
select(date, email) %>%
filter(date <= '2021-05-31') %>%
plot_anomaly_diagnostics(date, email)If we now clean up our email time series we can re-inspect the new data visually. We can still confirm a lack of strong trend, but in order to explore seasonality, we’ll need to plot our data in a different manner.
CS_Forecast_Exercise_Data %>%
select(date, email) %>%
filter(date <= '2021-05-31') %>%
mutate(email = ts_clean_vec(email)) %>%
plot_time_series(date, email)In our next plot we’ll explore seasonal diagnostics more closely. Weekly seasonality does seem apparent with Saturday and Sunday representing down days, and the work week a relatively steady plateau of volume.
CS_Forecast_Exercise_Data %>%
select(date, email) %>%
filter(date <= '2021-05-31') %>%
mutate(email = ts_clean_vec(email)) %>%
plot_seasonal_diagnostics(date, email, .interactive = FALSE) +
theme_minimal()Forecast
Next we’ll set up a forecast framework where various models will be trained and evaluated.
First well split our data into training and testing portions. This will allow us to evaluate our models.
# create forecast series
cs_ts <- CS_Forecast_Exercise_Data %>%
select(date, email) %>%
filter(date <= '2021-05-31') %>%
mutate(email = ts_clean_vec(email))
# Split Data 80/20
splits <- initial_time_split(cs_ts, prop = 0.8)Next we’ll create multiple models to evaluate
# Model 1: auto_arima ----
model_fit_arima_no_boost <- arima_reg() %>%
set_engine(engine = "auto_arima") %>%
fit(email ~ date, data = training(splits))
# Model 2: arima_boost ----
model_fit_arima_boosted <- arima_boost(
min_n = 2,
learn_rate = 0.015
) %>%
set_engine(engine = "auto_arima_xgboost") %>%
fit(email ~ date + as.numeric(date) + day(date),
data = training(splits))
# Model 3: ets ----
model_fit_ets <- exp_smoothing() %>%
set_engine(engine = "ets") %>%
fit(email ~ date, data = training(splits))
# Model 4: prophet ----
model_fit_prophet <- prophet_reg() %>%
set_engine(engine = "prophet", weekly.seasonality=TRUE) %>%
fit(email ~ date, data = training(splits))
# Model 5: lm ----
model_fit_lm <- linear_reg() %>%
set_engine("lm") %>%
fit(email ~ as.numeric(date) + wday(date),
data = training(splits))Now we’ll add these to a model structure that allows us to more easily handle model evaluation
models_tbl <- modeltime_table(
model_fit_arima_no_boost,
model_fit_arima_boosted,
model_fit_ets,
model_fit_prophet,
model_fit_lm
)
models_tbl# Modeltime Table
# A tibble: 5 x 3
.model_id .model .model_desc
<int> <list> <chr>
1 1 <fit[+]> ARIMA(4,0,0)(0,1,2)[7]
2 2 <fit[+]> ARIMA(1,0,3)(0,1,1)[7] W/ XGBOOST ERRORS
3 3 <fit[+]> ETS(M,N,M)
4 4 <fit[+]> PROPHET
5 5 <fit[+]> LM
Now we’ll evaluate the models on test data and add this data to our model table structure.
# Modeltime Table
# A tibble: 5 x 5
.model_id .model .model_desc .type .calibration_data
<int> <list> <chr> <chr> <list>
1 1 <fit[+]> ARIMA(4,0,0)(0,1,2)[7] Test <tibble [177 × 4…
2 2 <fit[+]> ARIMA(1,0,3)(0,1,1)[7] W/ XGBOOST … Test <tibble [177 × 4…
3 3 <fit[+]> ETS(M,N,M) Test <tibble [177 × 4…
4 4 <fit[+]> PROPHET Test <tibble [177 × 4…
5 5 <fit[+]> LM Test <tibble [177 × 4…
Now we can use this framework to visually inspect our models.
calibration_tbl %>%
modeltime_forecast(
new_data = testing(splits),
actual_data = cs_ts
) %>%
plot_modeltime_forecast(
.legend_max_width = 25, # For mobile screens
.interactive = TRUE
)All models have wide confidence intervals and don’t seem to capture the decreasing trend beginning approximately April 2021. Both ARIMA models capture the weekly patterns well and deviate the least in the period after April 2021.
Now we should also take a look at our error metrics in tabular form. The boosted ARIMA model has the lowest mean absolute error and highest r squared.
| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | ARIMA(4,0,0)(0,1,2)[7] | Test | 1355.06 | 13.54 | 1.09 | 14.19 | 1808.05 | 0.23 |
| 2 | ARIMA(1,0,3)(0,1,1)[7] W/ XGBOOST ERRORS | Test | 1258.45 | 12.71 | 1.01 | 13.08 | 1705.33 | 0.26 |
| 3 | ETS(M,N,M) | Test | 1958.16 | 21.89 | 1.57 | 18.99 | 2328.88 | 0.24 |
| 4 | PROPHET | Test | 1489.10 | 16.40 | 1.20 | 15.09 | 1984.55 | 0.07 |
| 5 | LM | Test | 1633.22 | 15.70 | 1.31 | 16.90 | 2135.07 | 0.02 |
Now we’ll refit these models to the entire data set and forecast forward 7 months to the end of 2021. All models except for our linear model provide reasonable step ahead forecasts for daily email volume to the end of 2021.
Conclusion / Next Steps
- All models except for our linear models capture the weekly pattern of email volume well and provide reasonable forecasts
- In testing each model the ARIMA & ARIMA Boost models performed the best
- Choose a model considering both accuracy, interpret-ability, and computational cost
- Develop further studies(s) based on model(s) chosen
- Develop computational workflow based on model(s) chosen
Epilogue
As an example, we’ll assume that a consensus has been made to explore the regression model further. To do this the model is isolated and further developed by adding features such as nominal day of week and holiday. From this we can view our model coefficients for a preliminary confirmation of seasonal effects, as well as those of holidays.
# recipe
lm_rec <-
recipe(email ~ ., data = training(splits)) %>%
step_date(date, features = c("dow", "month")) %>%
step_holiday(date, holidays = timeDate::listHolidays("US")) %>%
step_rm(date) %>%
step_dummy(all_nominal(), -all_outcomes())
# model specs
lm_spec <- linear_reg() %>%
set_engine(engine = "lm")
# lm wflow
lm_wflow <-
workflow() %>%
add_model(lm_spec) %>%
add_recipe(lm_rec)
# lm fit
lm_fit <-
lm_wflow %>%
fit(data = training(splits))
# view fit coefs
lm_fit %>%
pull_workflow_fit() %>%
tidy(conf.int = TRUE) %>%
ggplot(aes(term, estimate)) +
geom_point() +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
coord_flip() +
labs(title = "Coefficients of a linear regression model")We can also check to see if the model has improved with the added features. Using our previous model as comparison we can see that our rmse has improved from a previous value of approximately 2130 down to about 1750. This improvement is along the lines of the best time series model in terms of rmse and r2 measures. We now have two viable choices for forecasting, an ARIMA variant as well as a more parsimonious linear model.
# refit on all
last_lm <-
lm_wflow %>%
last_fit(splits)
# view metrics
last_lm %>%
collect_metrics() %>%
select(-.config, -.estimator) %>% knitr::kable()| .metric | .estimate |
|---|---|
| rmse | 1752.1464753 |
| rsq | 0.3553691 |